## **How to Write Fast Numerical Code**

Spring 2011 Lecture 17

Instructor: Markus Püschel

**TA:** Georg Ofenbeck



Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

## **SIMD Extensions and SSE**

- Overview
- SSE family, floating point, and x87
- SSE intrinsics
- Compiler vectorization
- This material was developed together with Franz Franchetti,
   Carnegie Mellon

# SIMD (Single Instruction Multiple Data) Vector Extensions

#### What is it?

 Extension of the ISA. Data types and instructions for the parallel computation on short (length 2-8) vectors of integers or floats



Names: MMX, SSE, SSE2, ...

#### Why do they exist?

- Useful: Many applications have the necessary fine-grain parallelism
   Then: speedup by a factor close to vector length
- Doable: Chip designers have enough transistors to play with

ммх:

Multimedia extension

SSE:

Streaming SIMD extension

**AVX**:

Advanced vector extensions



time

# **SSE Family: Floating Point**



- Not drawn to scale
- **■** From SSE3: Only additional instructions
- Every Core 2 has SSE3

# **Overview Floating-Point Vector ISAs**

| Vendor   | Name                                                      |     | u-way                            | Precision                            | Introduced with                                                                                  |
|----------|-----------------------------------------------------------|-----|----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------|
| Intel    | SSE<br>SSE2<br>SSE3<br>SSSE3<br>SSE4<br>AVX               | +   | 4-way<br>2-way<br>8-way<br>4-way | single<br>double<br>single<br>double | Pentium III Pentium 4 Pentium 4 (Prescott) Core Duo Core2 Extreme (Penryn) Core i7 (Sandybridge) |
| Intel    | IPF                                                       |     | 2-way                            | single                               | Itanium                                                                                          |
| Intel    | LRB                                                       |     | 16-way<br>8-way                  | single<br>double                     | Larrabee                                                                                         |
| AMD      | 3DNow!<br>Enhanced 3DNow!<br>3DNow! Professional<br>AMD64 | +++ | 2-way<br>4-way<br>2-way          | single<br>single<br>double           | K6<br>K7<br>Athlon XP<br>Opteron                                                                 |
| Motorola | AltiVec                                                   |     | 4-way                            | single                               | MPC 7400 G4                                                                                      |
| IBM      | VMX<br>SPU                                                | +   | 4-way<br>2-way                   | single<br>double                     | PowerPC 970 G5<br>Cell BE                                                                        |
| IBM      | Double FPU                                                |     | 2-way                            | double                               | PowerPC 440 FP2                                                                                  |

Within an extension family, newer generations add features to older ones Convergence: 3DNow! Professional = 3DNow! + SSE; VMX = AltiVec;

## Core 2

- Has SSE3
- 16 SSE registers



# **SSE3** Registers





- 16-way byte
- 8-way 2 bytes
- 4-way 4 bytes

|  |  |  |  | 128 bit |  |          |  |  |  |          | LSB |  |  |
|--|--|--|--|---------|--|----------|--|--|--|----------|-----|--|--|
|  |  |  |  |         |  |          |  |  |  |          |     |  |  |
|  |  |  |  |         |  |          |  |  |  |          |     |  |  |
|  |  |  |  |         |  |          |  |  |  |          |     |  |  |
|  |  |  |  |         |  | <u> </u> |  |  |  | <u> </u> |     |  |  |
|  |  |  |  |         |  |          |  |  |  |          |     |  |  |

### Floating point vectors:

- 4-way single (since SSE)
- 2-way double (since SSE2)

### Floating point scalars:

- single (since SSE)
- double (since SSE2)

# **SSE3** Instructions: Examples

■ Single precision 4-way vector add: addps %xmm0 %xmm1



■ Single precision scalar add: addss %xmm0 %xmm1



## **SSE3 Instruction Names**







Compiler will use this for floating point

- on x86-64
- with proper flags if SSE/SSE2 is available

# x86-64 FP Code Example

#### Inner product of two vectors

- Single precision arithmetic
- Compiled: uses SSE instructions

```
ipf:
                               # result = 0.0
  xorps %xmm1, %xmm1
  xorl
          %ecx, %ecx
                               # i = 0
  jmp
          .L8
                                # goto middle
.L10:
                                # loop:
  movslq %ecx,%rax
                                \# icpy = i
  incl
         %ecx
                                # i++
  movss (%rsi,%rax,4), %xmm0
                                # t = y[icpy]
  mulss (%rdi,%rax,4), %xmm0
                                # t *= x[icpy]
                                # result += t
  addss %xmm0, %xmm1
.L8:
                                # middle:
  cmpl %edx, %ecx
                                # i:n
  jl
     .L10
                                # if < goto loop
  movaps %xmm1, %xmm0
                                # return result
  ret
```

# The Other Floating Point (x87)

#### History

- 8086: first computer to implement IEEE FP (separate 8087 FPU = floating point unit)
- Logically stack based
- 486: merged FPU and Integer Unit onto one chip
- Default on x86-32 (since SSE is not guaranteed)
- Became obsolete with x86-64

#### Floating Point Formats

- single precision (C float): 32 bits
- double precision (C double): 64 bits
- extended precision (C long double): 80 bits



# **x87 FPU Instructions and Register Stack**

- Sample instructions:
  - flds (load single precision)
  - fmuls (mult single precision)
  - faddp (add and pop)

- 8 registers %st(0) %st(7)
- Logically form stack
- Top: %st(0)



Bottom disappears (drops out) after too many pushs

# FP Code Example (x87)

#### Inner product of two vectors

Single precision arithmetic

```
pushl %ebp
                         # setup
  mov1 %esp,%ebp
  pushl %ebx
  movl 8(%ebp),%ebx # %ebx=&x
  movl 12(%ebp),%ecx # %ecx=&y
  movl 16(\%ebp), %edx # %edx=n
  fldz
                         # push +0.0
  xorl %eax,%eax
                         # i=0
  cmpl %edx,%eax
                         # if i>=n done
  ige .L3
.L5:
  flds (%ebx,%eax,4) # push x[i]
  fmuls (%ecx, %eax, 4) # st(0)*=y[i]
                         # st(1)+=st(0); pop
  faddp
                         # i++
  incl %eax
                         # if i<n repeat
  cmpl %edx,%eax
  il .L5
.L3:
  movl -4(%ebp),%ebx
                         # finish
  mov1 %ebp, %esp
  popl %ebp
                         # st(0) = result
  ret
```

## From Core 2 Manual

|                              | 13550   | 37533 | 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|------------------------------|---------|-------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Single-precision (SP) FP MUL | 4, 1    | 4, 1  | Issue port 0; Writeback port 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Double-precision FP MUL      | 5, 1    | 5, 1  | A CONTRACT OF STATE O |
| FP MUL (X87)                 | 5, 2    | 5, 2  | Issue port 0; Writeback port 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| FP Shuffle                   | 1, 1    | 1, 1  | FP shuffle does not handle QW                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| DIV/SQRT                     | 100     | 20    | shuffle.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| W 1107                       | Maria N | 3 4   | A T T PETER THE CO.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

SSE based FP x87 FP

## **Summary**

- On Core 2 there are two different (unvectorized) floating points
  - x87: obsolete, is default on x86-32
  - SSE based: uses only one slot, is default on x86-64
- SIMD vector floating point instructions
  - 4-way single precision: since SSE
  - 2-way double precision: since SSE2
  - Since on Core 2 add and mult are fully pipelined (1 per cycle): possible gain 4x and 2x, respectively

# **SSE:** How to Take Advantage?



- Necessary: fine grain parallelism
- Options:
  - Use vectorized libraries (easy, not always available)
  - Write assembly
  - Use intrinsics (focus of this course)
  - Compiler vectorization (this course)
- We will focus on floating point and single precision (4-way)

## **SIMD Extensions and SSE**

- Overview
- SSE family, floating point, and x87
- SSE intrinsics
- Compiler vectorization

#### **References:**

Intel icc manual (currently 12.0)  $\rightarrow$  Intrinsics reference

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/index.htm

Visual Studio Manual (also: paste the intrinsic into Google)

http://msdn.microsoft.com/de-de/library/26td21ds.aspx

# **SSE Family: Floating Point**



- Not drawn to scale
- **■** From SSE3: Only additional instructions
- Every Core 2 has SSE3

# **SSE Family Intrinsics**

- Assembly coded C functions
- Expanded inline upon compilation: no overhead
- Like writing assembly inside C
- Floating point:
  - Intrinsics for math functions: log, sin, ...
  - Intrinsics for SSE

- Our introduction is based on icc
  - Most intrinsics work with gcc and Visual Studio (VS)
  - Some language extensions are icc (or even VS) specific

## **Header files**

SSE: xmmintrin.h

SSE2: emmintrin.h

SSE3: pmmintrin.h

SSSE3: tmmintrin.h

SSE4: smmintrin.h and nmmintrin.h

or ia32intrin.h

## Visual Conventions We Will Use

Memory
increasing address
memory

#### Registers

Before (and common)



Now we will use



# **SSE Intrinsics (Focus Floating Point)**

#### Data types

```
_m128 f; // = {float f0, f1, f2, f3}
m128d d; // = \{double d0, d1\}
m128i i; // 16 8-bit, 8 16-bit, 4 32-bit, or 2 64-bit ints
                                         ints
                                         ints
                                         ints or floats
                                         ints or doubles
```

# **SSE Intrinsics (Focus Floating Point)**

#### Instructions

- Naming convention: \_mm\_<intrin\_op>\_<suffix>
- Example:

```
// a is 16-byte aligned
float a[4] = {1.0, 2.0, 3.0, 4.0};
__m128 t = _mm_load_ps(a);
```

p: packeds: single

Same result as

```
_{m128} t = _{mm_set_ps}(4.0, 3.0, 2.0, 1.0)
```

## **SSE Intrinsics**

Native instructions (one-to-one with assembly)

```
_mm_load_ps()
_mm_add_ps()
_mm_mul_ps()
```

Multi instructions (map to several assembly instructions)

```
_mm_set_ps()
_mm_set1_ps()
```

Macros and helpers

```
_MM_TRANSPOSE4_PS()
_MM_SHUFFLE()
```

•••

## What Are the Main Issues?

- Alignment is important (128 bit = 16 byte)
- You need to code explicit loads and stores (what does that remind you of?)
- Overhead through shuffles

## **SSE Intrinsics**

- Load and store
- Constants
- Arithmetic
- Comparison
- Conversion
- Shuffles

## **Loads and Stores**

| Intrinsic Name | Operation                                          | Corresponding SSE Instructions |
|----------------|----------------------------------------------------|--------------------------------|
| _mm_loadh_pi   | Load high                                          | MOVHPS reg, mem                |
| _mm_loadl_pi   | Load low                                           | MOVLPS reg, mem                |
| _mm_load_ss    | Load the low value and clear the three high values | MOVSS                          |
| _mm_load1_ps   | Load one value into all four words                 | MOVSS + Shuffling              |
| _mm_load_ps    | Load four values, address aligned                  | MOVAPS                         |
| _mm_loadu_ps   | Load four values, address unaligned                | MOVUPS                         |
| _mm_loadr_ps   | Load four values in reverse                        | MOVAPS + Shuffling             |

| Intrinsic Name | Operation                                         | Corresponding SSE Instruction |
|----------------|---------------------------------------------------|-------------------------------|
| _mm_set_ss     | Set the low value and clear the three high values | Composite                     |
| _mm_set1_ps    | Set all four words with the same value            | Composite                     |
| _mm_set_ps     | Set four values, address aligned                  | Composite                     |
| _mm_setr_ps    | Set four values, in reverse order                 | Composite                     |
| _mm_setzero_ps | Clear all four values                             | Composite                     |

## **Loads and Stores**



```
a = _mm_load_ps(p); // p 16-byte aligned
```

avoid (expensive)

# **How to Align**

m128, \_\_m128d, \_\_m128i are 16-byte aligned

Arrays:

```
__declspec(align(16)) float g[4];
```

#### Dynamic allocation

- \_mm\_malloc() and \_mm\_free()
- Write your own malloc that returns 16-byte aligned addresses
- Some malloc's already guarantee 16-byte alignment

## **Loads and Stores**



## **Loads and Stores**



# **Stores Analogous to Loads**

| Intrinsic Name | Operation                                                  | Corresponding SSE Instruction |
|----------------|------------------------------------------------------------|-------------------------------|
| _mm_storeh_pi  | Store high                                                 | MOVHPS mem, reg               |
| _mm_storel_pi  | Store low                                                  | MOVLPS mem, reg               |
| _mm_store_ss   | Store the low value                                        | MOVSS                         |
| _mm_store1_ps  | Store the low value across all four words, address aligned | Shuffling + MOVSS             |
| _mm_store_ps   | Store four values, address aligned                         | MOVAPS                        |
| _mm_storeu_ps  | Store four values, address unaligned                       | MOVUPS                        |
| _mm_storer_ps  | Store four values, in reverse order                        | MOVAPS + Shuffling            |

## **Constants**

```
d = _mm_setzero_ps();
```

# **Arithmetic**

#### SSE

| Intrinsic Name | Operation               | Corresponding SSE Instruction |
|----------------|-------------------------|-------------------------------|
| _mm_add_ss     | Addition                | ADDSS                         |
| _mm_add_ps     | Addition                | ADDPS                         |
| _mm_sub_ss     | Subtraction             | SUBSS                         |
| _mm_sub_ps     | Subtraction             | SUBPS                         |
| _mm_mul_ss     | Multiplication          | MULSS                         |
| _mm_mul_ps     | Multiplication          | MULPS                         |
| _mm_div_ss     | Division                | DIVSS                         |
| _mm_div_ps     | Division                | DIVPS                         |
| _mm_sqrt_ss    | Squared Root            | SQRTSS                        |
| _mm_sqrt_ps    | Squared Root            | SQRTPS                        |
| _mm_rcp_ss     | Reciprocal              | RCPSS                         |
| _mm_rcp_ps     | Reciprocal              | RCPPS                         |
| _mm_rsqrt_ss   | Reciprocal Squared Root | RSQRTSS                       |
| _mm_rsqrt_ps   | Reciprocal Squared Root | RSQRTPS                       |
| _mm_min_ss     | Computes Minimum        | MINSS                         |
| _mm_min_ps     | Computes Minimum        | MINPS                         |
| _mm_max_ss     | Computes Maximum        | MAXSS                         |
| _mm_max_ps     | Computes Maximum        | MAXPS                         |

#### SSE3

| Intrinsic Name | Operation        | Corresponding SSE3 Instruction |
|----------------|------------------|--------------------------------|
| _mm_addsub_ps  | Subtract and add | ADDSUBPS                       |
| _mm_hadd_ps    | Add              | HADDPS                         |
| _mm_hsub_ps    | Subtracts        | HSUBPS                         |

#### SSE4

| Intrinsic | Operation                    | Corresponding SSE4 Instruction |  |
|-----------|------------------------------|--------------------------------|--|
| _mm_dp_ps | Single precision dot product | DPPS                           |  |

## **Arithmetic**



#### analogous:

$$c = _{mm\_sub\_ps(a, b)};$$

$$c = _{mm_mul_ps(a, b)};$$

# **Example**

```
void addindex(float *x, int n) {
  for (int i = 0; i < n; i++)
    x[i] = x[i] + i;
}</pre>
```

```
#include <ia32intrin.h>

// n a multiple of 4, x is 16-byte aligned
void addindex_vec(float *x, int n) {
    __m128 index, x_vec;

for (int i = 0; i < n/4; i++) {
    x_vec = _mm_load_ps(x+i*4);
    index = _mm_set_ps(i*4+3, i*4+2, i*4+1, i*4); // create vector with indexes
    x_vec = _mm_add_ps(x_vec, index); // add the two
    __mm_store_ps(x+i*4, x_vec); // store back
}
</pre>
```

Note how using intrinsics implicitly forces scalar replacement!

# **Example: Better Solution**

```
void addindex(float *x, int n) {
  for (int i = 0; i < n; i++)
    x[i] = x[i] + i;
}</pre>
```

Note how using intrinsics implicitly forces scalar replacement!









#### analogous:

# **Example**

```
// n is even
void lp(float *x, float *y, int n) {
  for (int i = 0; i < n/2; i++)
    y[i] = (x[2*i] + x[2*i+1])/2;
}</pre>
```

```
__m128 _mm_dp_ps(__m128 a, __m128 b, const int mask)
```

**(SSE4)** Computes the pointwise product of a and b and writes a selected sum of the resulting numbers into selected elements of c; the others are set to zero. The selections are encoded in the mask.

**Example:** mask = 117 = 01110101



# **Comparisons**

| Intrinsic Name | Operation                    | Corresponding SSE Instruction |
|----------------|------------------------------|-------------------------------|
| _mm_cmpeq_ss   | Equal                        | CMPEQSS                       |
| _mm_cmpeq_ps   | Equal                        | CMPEQPS                       |
| _mm_cmplt_ss   | Less Than                    | CMPLTSS                       |
| _mm_cmplt_ps   | Less Than                    | CMPLTPS                       |
| _mm_cmple_ss   | Less Than or Equal           | CMPLESS                       |
| _mm_cmple_ps   | Less Than or Equal           | CMPLEPS                       |
| _mm_cmpgt_ss   | Greater Than                 | CMPLTSS                       |
| _mm_cmpgt_ps   | Greater Than                 | CMPLTPS                       |
| _mm_cmpge_ss   | Greater Than or Equal        | CMPLESS                       |
| _mm_cmpge_ps   | Greater Than or Equal        | CMPLEPS                       |
| _mm_cmpneq_ss  | Not Equal                    | CMPNEQSS                      |
| _mm_cmpneq_ps  | Not Equal                    | CMPNEQPS                      |
| _mm_cmpnlt_ss  | Not Less Than                | CMPNLTSS                      |
| _mm_cmpnlt_ps  | Not Less Than                | CMPNLTPS                      |
| _mm_cmpnle_ss  | Not Less Than or Equal       | CMPNLESS                      |
| _mm_cmpnle_ps  | Not Less Than or Equal       | CMPNLEPS                      |
| _mm_cmpngt_ss  | Not Greater Than             | CMPNLTSS                      |
| _mm_cmpngt_ps  | Not Greater Than             | CMPNLTPS                      |
| _mm_cmpnge_ss  | Not Greater Than or<br>Equal | CMPNLESS                      |
| _mm_cmpnge_ps  | Not Greater Than or<br>Equal | CMPNLEPS                      |

| Intrinsic Name  | Operation             | Corresponding SSE Instruction |
|-----------------|-----------------------|-------------------------------|
| _mm_cmpord_ss   | Ordered               | CMPORDSS                      |
| _mm_cmpord_ps   | Ordered               | CMPORDPS                      |
| _mm_cmpunord_ss | Unordered             | CMPUNORDSS                    |
| _mm_cmpunord_ps | Unordered             | CMPUNORDPS                    |
| _mm_comieq_ss   | Equal                 | COMISS                        |
| _mm_comilt_ss   | Less Than             | COMISS                        |
| _mm_comile_ss   | Less Than or Equal    | COMISS                        |
| _mm_comigt_ss   | Greater Than          | COMISS                        |
| _mm_comige_ss   | Greater Than or Equal | COMISS                        |
| _mm_comineq_ss  | Not Equal             | COMISS                        |
| _mm_ucomieq_ss  | Equal                 | UCOMISS                       |
| _mm_ucomilt_ss  | Less Than             | UCOMISS                       |
| _mm_ucomile_ss  | Less Than or Equal    | UCOMISS                       |
| _mm_ucomigt_ss  | Greater Than          | UCOMISS                       |
| _mm_ucomige_ss  | Greater Than or Equal | UCOMISS                       |
| _mm_ucomineq_ss | Not Equal             | UCOMISS                       |

# **Comparisons**



#### analogous:

etc.

#### Each field:

0xffffffff if true 0x0 if false

Return type \_\_m128

# **Conversion**

| Intrinsic Name   | Operation                         | Corresponding SSE Instruction |
|------------------|-----------------------------------|-------------------------------|
| _mm_cvtss_si32   | Convert to 32-bit integer         | CVTSS2SI                      |
| _mm_cvtss_si64*  | Convert to 64-bit integer         | CVTSS2SI                      |
| _mm_cvtps_pi32   | Convert to two 32-bit integers    | CVTPS2PI                      |
| _mm_cvttss_si32  | Convert to 32-bit integer         | CVTTSS2SI                     |
| _mm_cvttss_si64* | Convert to 64-bit integer         | CVTTSS2SI                     |
| _mm_cvttps_pi32  | Convert to two 32-bit integers    | CVTTPS2PI                     |
| _mm_cvtsi32_ss   | Convert from 32-bit integer       | CVTSI2SS                      |
| _mm_cvtsi64_ss*  | Convert from 64-bit integer       | CVTSI2SS                      |
| _mm_cvtpi32_ps   | Convert from two 32-bit integers  | CVTTPI2PS                     |
| _mm_cvtpi16_ps   | Convert from four 16-bit integers | composite                     |
| _mm_cvtpu16_ps   | Convert from four 16-bit integers | composite                     |
| _mm_cvtpi8_ps    | Convert from four 8-bit integers  | composite                     |
| _mm_cvtpu8_ps    | Convert from four 8-bit integers  | composite                     |
| _mm_cvtpi32x2_ps | Convert from four 32-bit integers | composite                     |
| _mm_cvtps_pi16   | Convert to four 16-bit integers   | composite                     |
| _mm_cvtps_pi8    | Convert to four 8-bit integers    | composite                     |
| _mm_cvtss_f32    | Extract                           | composite                     |

## **Conversion**

```
float _mm_cvtss_f32(__m128 a)
```



```
float f;
f = _mm_cvtss_f32(a);
```

### Cast



Reinterprets the four single precision floating point values in a as four 32-bit integers, and vice versa.

#### No conversion is performed.

Makes integer shuffle instructions usable for floating point.

#### SSE

| Intrinsic Name  | Operation                               | Corresponding SSE Instruction |
|-----------------|-----------------------------------------|-------------------------------|
| _mm_shuffle_ps  | Shuffle                                 | SHUFPS                        |
| _mm_unpackhi_ps | Unpack High                             | UNPCKHPS                      |
| _mm_unpacklo_ps | Unpack Low                              | UNPCKLPS                      |
| _mm_move_ss     | Set low word, pass in three high values | MOVSS                         |
| _mm_movehl_ps   | Move High to Low                        | MOVHLPS                       |
| _mm_movelh_ps   | Move Low to High                        | MOVLHPS                       |
| _mm_movemask_ps | Create four-bit mask                    | MOVMSKPS                      |

#### SSE3

| Intrinsic Name  | Operation  | Corresponding SSE3 Instruction |
|-----------------|------------|--------------------------------|
| _mm_movehdup_ps | Duplicates | MOVSHDUP                       |
| _mm_moveldup_ps | Duplicates | MOVSLDUP                       |

#### SSSE3

| Intrinsic Name   | Operation | Corresponding SSSE3 Instruction |
|------------------|-----------|---------------------------------|
| _mm_shuffle_epi8 | Shuffle   | PSHUFB                          |
| _mm_alignr_epi8  | Shift     | PALIGNR                         |

#### SSE4

| Intrinsic Syntax                                     | Operation                                                                                   | Corresponding SSE4 Instruction |
|------------------------------------------------------|---------------------------------------------------------------------------------------------|--------------------------------|
| m128 _mm_blend_ps(m128 v1,m128 v2, const int mask)   | Selects float single precision data from 2 sources using constant mask                      | BLENDPS                        |
| m128 _mm_blendv_ps(m128 v1,m128 v2,m128 v3)          | Selects float single precision data from 2 sources using variable mask                      | BLENDVPS                       |
| m128 _mm_insert_ps(m128 dst,m128 src, const int ndx) | Insert single precision float into packed single precision array element selected by index. | INSERTPS                       |
| int _mm_extract_ps(m128 src, const int ndx)          | Extract single precision float from packed single precision array selected by index.        | EXTRACTPS                      |





helper macro to create mask



# Example: Loading 4 Real Numbers from Arbitrary Memory Locations



7 instructions, this is the right way (before SSE4)

## **Code For Previous Slide**

# Example: Loading 4 Real Numbers from Arbitrary Memory Locations (cont'd)

- Whenever possible avoid the previous situation
- Restructure algorithm and use the aligned \_mm\_load\_ps()
- Other possibility (should also yields 7 instructions, trusting the compiler)

```
__m128 vf;
vf = _mm_set_ps(*p3, *p2, *p1, *p0);
```

- SSE4: \_mm\_insert\_epi32 together with \_mm\_castsi128\_ps
  - Not clear whether better

# Example: Loading 4 Real Numbers from Arbitrary Memory Locations (cont'd)

Do not do this (why?):

```
__declspec(align(16)) float g[4];
__m128 vf;

g[0] = *p0;
g[1] = *p1;
g[2] = *p2;
g[3] = *p3;
vf = __mm_load_ps(g);
```

# **Example: Storing 4 Real Numbers to Arbitrary Memory Locations**



7 instructions, shorter critical path (before SSE4)

```
__m128i _mm_alignr_epi8(__m128i a, __m128i b, const int n)
```

Concatenate a and b and extract byte-aligned result shifted to the right by n bytes

Example: View \_\_m128i as 4 32-bit ints; n = 12



Use with \_mm\_castsi128\_ps to do the same for floating point

# **Example**

```
void shift(float *x, float *y, int n) {
  for (int i = 0; i < n-1; i++)
    y[i] = x[i+1];
  y[n-1] = 0;
}</pre>
```

```
#include <ia32intrin.h>
// n a multiple of 4, x, y are 16-byte aligned
void shift vec(float *x, float *y, int n) {
 m128 f;
 __m128i i1, i2, i3;
 i1 = mm castps si128(mm load ps(x));
                                           // load first 4 floats and cast to int
for (int i = 0; i < n-8; i = i + 4) {
   i2 = _mm_castps_si128(_mm_load_ps(x+4+i)); // load next 4 floats and cast to int
   f = _mm_castsi128_ps(_mm_alignr_epi8(i2,i1,4)); // shift and extract and cast back
   mm store ps(y+i,f);
                                           // store it
   i1 = i2;
                                            // make 2nd element 1st
 // we are at the last 4
 _mm_store_ps(y+n-4,f);
                                            // store it
```

# **Vectorization**



Picture: www.druckundbestell.de

```
__m128i _mm_shuffle_epi8(__m128i a, __m128i mask)
```

Result is filled in each position by any element of a or with 0, as specified by mask

Example: View \_\_\_m128i as 4 32-bit ints



Use with \_mm\_castsi128\_ps to do the same for floating point

**(SSE4)** Result is filled in each position by an element of a or b in the same position as specified by mask

**Example:** mask = 2 = 0010



```
_MM_TRANSPOSE4_PS(row0, row1, row2, row3)
```

**Macro for 4 x 4 matrix transposition:** The arguments row0,..., row3 are \_\_m128 values each containing a row of a 4 x 4 matrix. After execution, row0, .., row 3 contain the columns of that matrix.



In SSE: 8 shuffles (4 mm unpacklo ps, 4 mm unpackhi ps)

# **Example: 4 x 4 Matrix-Vector Product**



**Blackboard** 

## Other Intrinsics

- Logical intrinsics (bitwise and, or, ...)
- Cacheability support intrinsics
  - Prefetch:
     void \_mm\_prefetch(char const \*a, int sel)
  - Loads that bypass the cache:
    void \_mm\_stream\_ps(float \*p, \_\_m128 a)
- Others

# **Vectorization With Intrinsics: Key Points**

- Use aligned loads and stores
- Minimize overhead (shuffle instructions)= maximize vectorization efficiency
- Definition: Vectorization efficiency

Op count of scalar (unvectorized) code
Op count (including shuffles) of vectorized code

- *Ideally:* Efficiency = v for v-way vector instructions
  (assumes no vector instruction does more than 4 scalar ops)
- Examples (blackboard):
  - Adding two vectors of length 4
  - 4 x 4 matrix-vector multiplication